Final Report

1. Overview

1.1 Problem

Our project is to analyze the hate content targeting Asia groups in Twitter and Reddit. There are two dimensions for our project:

1.2 Objective Review

  1. Describe the characteristics of hate speech against the Asian community on Twitter and Reddit
  2. Identify the behavior characteristics of the users who publish and/or endorse hate speech
  3. Compare the policies and community culture towards Asian community on Twitter and Reddit
  4. Stretched Objective: Describe the evolvement of the hate speech on Twitter and Reddit

Due to the time limitation, we only able to complish our 1st and 2nd goal. We touch a little bit on our third goal during our analysis, but find out that the scope is too large to finish in the time frame (It’s difficult to measure the culture and policies using our current data and it related to some private issues)

1.3 Context

We formulate this project because we see an increasing amount of hate crime targeting Asian groups after the pandemic. The discrimination also gets worse on social platform after Trump post hate content on Twitter. We hope our project can help people become aware of the situation.

packages

2. Data Preparation

2.1 Data Collection

We used the following keywords to search for the data: "China virus ", "China flu", "Kung flu", "Wuhan virus ", "#fuckchina", "#chinaisasshoe", "#chinaisterrorist ", "#boycottchina", "#blamechina", "#MakeChinaPay", "yellow invader", "rice nigger", "spink", "sideways vaginas/sideways vagina", "chinig", "paki", "Chinese wetback", "Dink” These words are extracted from two papers (Ziems et al 2020, Vidgen et al 2020) and an online database.

Twitter

Reddit

2.2 Data Cleaning

3. Findings

Q1: Words Frequency

The first question we want to analyze is the word frequency from all posts. We hope it can give us some general sense of how the problem look like. The analysis will be conduct on both platforms and in a general sense. Specifically, we have the following sub-questions that we want to answer through our analysis.

Preprocessing

  1. change the data type to datetime and string
  2. Split the dataframe to sub tables basesd on platforms and before and after covid

TF-IDF Score Generate

This part generates the table that contains the TF-IDF scores

Help function credits: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.YDmgHxPAS3I

CountVectorizer Score Generate

This part generates the table that contains the count frequncy

General analysis

We will use all the data and compare the result from two vectors

We can see from the heatmap that the frequent words in general are neural and hard to detect any specific topics.

We can see that count vector and TF-IDF give different frequent list. Some words are same such as good, like. TF-IDF finds more negative words such as damn, disgusting, frustrating, makechinapay which are absent from the results in countvector.

Hate words frequency

We mapped our predefined hate words list to our frequency table and found that only a small amount of these words appear in the top frequent words list. Among these. ‘Virus’, ‘paki’, ‘dink’ are most frequent words based on both the count vector and the TF-IDF vector. Hate words such as “makechinapay” don't have high frequency in count vectors but achieve a high TF-IDF score.

Platform Comparison

By conducting the frequency analysis on text from the two platforms respectively, we find that the word frequency shows distinctive patterns on platform-wise. Posts on Twitter use more neutral to positive words, and no particular entity is mentioned significantly more than others except virus. On the other hand, posts on Reddit talk about more negative things (or “dirty words”) than the posts on Twitter. Many entities are mentioned among the posts such as “Trump”.

The word cloud has different words if we use the frequency results from TF-IDF. However, the general conclusion remains similar. The words on Twitter seem more neutral and appear less entities, while the words from Reddit contain many negative/dirty words.

The result may be explained by two assumptions: first, people share less emotional text on Twitter than on Reddit. Second, Twitter has a more rigorous policy banning negative posts.

Before & After COVID Comparison

The word frequency also has different patterns before and after the pandemic. Before pandemic, people use more positive words. “Good”, “well” are more frequently used than others. After a pandemic, the frequent words list changes a lot. “Good” and “well” are no longer used that frequently and people start talking about random things. Without doubt, the pandemic change a lot for people’s daily life.

Before & After comparison by platform

Before the pandemic, posts on Twitter are more policy-related. People talk more about “us”, “trump”. We can see the pattern changes differently after the pandemic. “virus” becomes a common words in all the posts and more insulting languages are used.

It’s interesting to see that the word frequency looks almost the same on Reddit. It may due to the different regulation between the two platfroms. Moreover, we can see that the topics discussed in these two platform also vary a lot. Reddit has less policy related content compared to Twitter, which may also explain why it doesn't change too much before and after COVID.

Q2. Sentiment Analysis of the Posts

Methods:

Process and Results:

1. Vader to obtain the sentiment scores of the posts
2. Did the sentiment scores change over time? If so, how?

The average negativity scores of the posts after covid is higher than the score before covid, which indicate that the posts might possess a tendency of becoming more negative.

The average neutral scores of the posts after covid is pretty much the same as the score before covid.

The average positivity scores of the posts after covid is higher than the score before covid, which indicate the posts might possess a tendency of becoming less positive.

The average compound scores of the posts after covid is much more negative comparing to the score before covid, which again, indicates that the posts on the two platforms might become more negative after covid.

The number of the posts with higher negativity scores, which are posts that are more obviously negative, increased after covid.

The number of the posts with higher neutral scores, which are posts that are more obviously neutral, was higher before covid.

The number of the posts with higher positivity scores, which are posts that are more obviously positive, decreased after covid.

The distribution of the compound scores of the posts proves again that there were more positive posts and less negative posts before covid

3. What would the cross-platform comparison be like?

We can observe from these charts that the numbers of posts under each sentiment category do not seem to change much before and after covid. On the other hand, on Twitter, negative posts increased a lot, while positive and neutral posts reduced a lot, after covid. Based on the results, the changes of the numbers show different tendencies on the two platforms. Considering that different platforms have their own ways of dealing with hate contents, i.e. online community policy, one assumption is that Reddit might have done significant moderating work to eliminate the posts that are deemed to be hate contents.

Q3. Responses to the Contents

1. What are the responses to the negative contents on the two platforms?

We can see from the results that on Twitter, out of all 1890 negative tweets, 950 tweets are most favorite, which is about 50.26%.

We can see from the results that out of all 1890 negative tweets, 961 tweets are most favorite, which is about 50.85%.

Between Top 200 most negative submission and Top 200 submissions with highest karma scores, only 7.5% is overlapped, which is quite different from the Twitter finding.

2. Further look into the responses to contents on Reddit

Above are the bar charts of the average scores of the three sentiment categories, before and after covid. Note that in these charts, the tweaking work has been done which removed the outliers data points. We can tell from the charts that, while people do not endorse negative posts as much after covid, people were much more willing to endorse the positive and neutral posts before covid. This finding, again, seems quite different from the finding of Twitter.

Q4. sentiment scores and user features

Most of the tweets came from Android users. After the COVID, there are more iPhone Users and Web App users detected.

The user location information is not reliable. But we still want to take a look at that. Most of the locations we found are from the USA.

The most popular tweet is not really against Asian.

topics:

  • before: cybersecurity concern, animal protection, disappointing incidents, events
  • after: virus </font>